Near neighbor searching with K nearest references
نویسندگان
چکیده
Proximity searching is the problem of retrieving, from a given database, those objects closest to a query. To avoid exhaustive searching, data structures called indexes are built on the database prior to serving queries. The curse of dimensionality is a well-known problem for indexes: in spaces with sufficiently concentrated distance histograms, no index outperforms an exhaustive scan of the database. In recent years, a number of indexes for approximate proximity searching have been proposed. These are able to cope with the curse of dimensionality in exchange for returning an answer that might be slightly different from the correct one. In this paper we show that many of those recent indexes can be understood as variants of a simple general model based on K-nearest reference signatures. A set of references is chosen from the database, and the signature of each object consists of theK references nearest to the object. At query time, the signature of the query is computed and the search examines only the objects whose signature is close enough to that of the query. Many known and novel indexes are obtained by considering different ways to determine how much detail the signature records (e.g., just the set of nearest references, or also their proximity order to the object, or also their distances to the object, and so on), how the similarity between signatures is defined, and how the parameters are tuned. In addition, we introduce a space-efficient representation for those families of indexes, making it possible to search very large databases in main memory. We perform exhaustive experiments comparing several known and new indexes that derive from our framework, evaluating their time performance, memory usage, and quality of approximation. The best indexes outperform the state Email addresses: [email protected] (E. Chávez), [email protected] (M. Graff), [email protected] (G. Navarro), [email protected] (E.S. Téllez) 1Partially funded by (CONACyT grant 179795) Mexico 2Funded with basal funds FB0001, Conicyt, Chile Preprint submitted to Elsevier December 3, 2014 of the art, offering an attractive balance between all these aspects, and turn out to be excellent choices in many scenarios. Our framework gives high flexibility to design new indexes.
منابع مشابه
Non-zero probability of nearest neighbor searching
Nearest Neighbor (NN) searching is a challenging problem in data management and has been widely studied in data mining, pattern recognition and computational geometry. The goal of NN searching is efficiently reporting the nearest data to a given object as a query. In most of the studies both the data and query are assumed to be precise, however, due to the real applications of NN searching, suc...
متن کاملAn Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification
The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...
متن کاملAn Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification
The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...
متن کاملFUZZY K-NEAREST NEIGHBOR METHOD TO CLASSIFY DATA IN A CLOSED AREA
Clustering of objects is an important area of research and application in variety of fields. In this paper we present a good technique for data clustering and application of this Technique for data clustering in a closed area. We compare this method with K-nearest neighbor and K-means.
متن کاملA Parallel Algorithms on Nearest Neighbor Search
The (k-)nearest neighbor searching has very high computational costs. The algorithms presented for nearest neighbor search in high dimensional spaces have have suffered from curse of dimensionality, which affects either runtime or storage requirements of the algorithms terribly. Parallelization of nearest neighbor search is a suitable solution for decreasing the workload caused by nearest neigh...
متن کاملA Survey of Techniques for Fixed Radius near Neighbor Searching . . . 4 . . . 18
I ARS'IRACT This paper is a survey of techniques used for searching in a multidimensicnal space. Though we ccnsider specifically the Froblem of searching for fixed radius near neighbors (that is, all points within a fixed distance of a given point), the structures presented bere are applicable to eiany different search problems in multidimensional spaces. The orientation of this paper is practi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Inf. Syst.
دوره 51 شماره
صفحات -
تاریخ انتشار 2015